43 research outputs found

    Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data

    Get PDF
    Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function

    Integration of relational and hierarchical network information for protein function prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions.</p> <p>Results</p> <p>We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing.</p> <p>Conclusion</p> <p>A cross-validation study, using data from the yeast <it>Saccharomyces cerevisiae</it>, shows our method offers substantial improvements over both standard 'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field methods, whether in their original form or when post-processed to artificially impose 'true-path' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GO-term depth. Additional <it>in silico </it>validation on a collection of new annotations recently added to GO confirms the advantages suggested by the cross-validation study. Taken as a whole, our results show that a hierarchical approach to network-based protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' network-based methods.</p

    iPSCORE: A Resource of 222 iPSC Lines Enabling Functional Characterization of Genetic Variation across a Variety of Cell Types.

    Get PDF
    Large-scale collections of induced pluripotent stem cells (iPSCs) could serve as powerful model systems for examining how genetic variation affects biology and disease. Here we describe the iPSCORE resource: a collection of systematically derived and characterized iPSC lines from 222 ethnically diverse individuals that allows for both familial and association-based genetic studies. iPSCORE lines are pluripotent with high genomic integrity (no or low numbers of somatic copy-number variants) as determined using high-throughput RNA-sequencing and genotyping arrays, respectively. Using iPSCs from a family of individuals, we show that iPSC-derived cardiomyocytes demonstrate gene expression patterns that cluster by genetic background, and can be used to examine variants associated with physiological and disease phenotypes. The iPSCORE collection contains representative individuals for risk and non-risk alleles for&nbsp;95% of SNPs associated with human phenotypes through genome-wide association studies. Our study demonstrates the utility of iPSCORE for examining how genetic variants influence molecular and physiological traits in iPSCs and derived cell lines

    Current Performance and On-Going Improvements of the 8.2 m Subaru Telescope

    Full text link
    An overview of the current status of the 8.2 m Subaru Telescope constructed and operated at Mauna Kea, Hawaii, by the National Astronomical Observatory of Japan is presented. The basic design concept and the verified performance of the telescope system are described. Also given are the status of the instrument package offered to the astronomical community, the status of operation, and some of the future plans. The status of the telescope reported in a number of SPIE papers as of the summer of 2002 are incorporated with some updates included as of 2004 February. However, readers are encouraged to check the most updated status of the telescope through the home page, http://subarutelescope.org/index.html, and/or the direct contact with the observatory staff.Comment: 18 pages (17 pages in published version), 29 figures (GIF format), This is the version before the galley proo

    Clustering of Lyman Break Galaxies at z=4 and 5 in The Subaru Deep Field: Luminosity Dependence of The Correlation Function Slope

    Full text link
    We explored the clustering properties of Lyman Break Galaxies (LBGs) at z=4 and 5 with an angular two-point correlation function on the basis of the very deep and wide Subaru Deep Field data. We found an apparent dependence of the correlation function slope on UV luminosity for LBGs at both z=4 and 5. More luminous LBGs have a steeper correlation function. To compare these observational results, we constructed numerical mock LBG catalogs based on a semianalytic model of hierarchical clustering combined with high-resolution N-body simulation, carefully mimicking the observational selection effects. The luminosity functions for LBGs predicted by this mock catalog were found to be almost consistent with the observation. Moreover, the overall correlation functions of LBGs were reproduced reasonably well. The observed dependence of the clustering on UV luminosity was not reproduced by the model, unless subsamples of distinct halo mass were considered. That is, LBGs belonging to more massive dark haloes had steeper and larger-amplitude correlation functions. With this model, we found that LBG multiplicity in massive dark halos amplifies the clustering strength at small scales, which steepens the slope of the correlation function. The hierarchical clustering model could therefore be reconciled with the observed luminosity-dependence of the angular correlation function, if there is a tight correlation between UV luminosity and halo mass. Our finding that the slope of the correlation function depends on luminosity could be an indication that massive dark halos hosted multiple bright LBGs (abridged).Comment: 16 pages, 17 figures, Accepted for publication in ApJ, Full resolution version is available at http://zone.mtk.nao.ac.jp/~kashik/sdf/acf/sdf_lbgacf.pd

    A crowdsourced set of curated structural variants for the human genome.

    Get PDF
    Funder: U.S. Food and Drug Administration; funder-id: http://dx.doi.org/10.13039/100000038A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies
    corecore